Fixes for forkserver/spawn serialization and fix for LMDB upgrade issues#148
Fixes for forkserver/spawn serialization and fix for LMDB upgrade issues#148christinaflo wants to merge 5 commits intomainfrom
Conversation
…ng methods. LMDB refactor to allow for forkserver/spawn serialization + resolve issues that required pinning lmdb for training.
…ure connection is closed when dataset init finishes prior to forking
|
Minor comment: the DB preloading logic might be better handled by using vmtouch, which is available in conda‑forge and often present on HPC systems, falling back to reading the whole DB file only when vmtouch is not available. Conceptually, this would only need to be done once per node, since the page cache is shared across processes. If using vmtouch, one could optionally consider daemon mode and page locking to reduce the risk of eager eviction under memory pressure, although this may or may not be desirable depending on system limits and sharing policies (I assume we typically reserve full nodes?). Overall, I suspect the usefulness of this depends strongly on DB size vs memory size, access locality, and competition for memory. I just found this which could be vendored. I never used it. |
|
Taking a look at this now @christinaflo, because this came up in a number of different PRs, prominently these two |
jandom
left a comment
There was a problem hiding this comment.
I'm not sure we need this any more with the changes in #143 – the two small reads from LMDB will be immediately released now (via context manager) and the persistent lmdb env is staying for the entire duration of the data loader (hopefully!)
but I'm 100% confident – training tends to reveal more problems than running unit tests
Do you have any repro/test that I could run to confirm that my claim is accurate?
|
These changes address a different issue than the one addressed in #143, the original version is not serializable because of the persistent lmdb env if you use forkserver/spawn (the getstate and setstate are needed). Also forking the persistent env across workers causes weird behavior/failures with versions of lmdb > 1.6, it hasnt worked for me on multiple systems which is why we pinned it. They added some documentation about this recently: https://lmdb.readthedocs.io/en/latest/#forking-multiprocessing |
I've never used it before, it isnt currently available on the cluster im using but happy to try it out via conda forge. The biggest db we have is around 10 GB. On one system, i did notice that I had to "rewarm" the cache between epochs so this could be useful to avoid that. |
Summary
LMDB refactor to allow for forkserver/spawn serialization and resolve issues that required pinning to LMDB==1.6.2 for training.
Changes
Related Issues
PR #143 defaults to forkserver. This PR adds additional fixes needed for training with forkserver.